Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get ensembl-reference wrapper to download more than one chromosome #3432

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dlaehnemann
Copy link
Contributor

@dlaehnemann dlaehnemann commented Nov 6, 2024

Currently, only the first chromosome from the list is downloaded.

QC

While the contributions guidelines are more extensive, please particularly ensure that:

  • test.py was updated to call any added or updated example rules in a Snakefile
  • input: and output: file paths in the rules can be chosen arbitrarily
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:)
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to
  • the meta.yaml contains a link to the documentation of the respective tool or command under url:
  • conda environments use a minimal amount of channels and packages, in recommended ordering

Summary by CodeRabbit

  • New Features

    • Enhanced error messages for clarity when selecting individual chromosomes.
    • Improved control flow for downloading sequence data, allowing for a more efficient exit from the loop.
  • Bug Fixes

    • Updated error messages for unsuccessful downloads to provide more specific feedback.
    • Maintained existing error handling for invalid datatype values to ensure robust performance.

Copy link
Contributor

coderabbitai bot commented Nov 6, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in this pull request involve modifications to the bio/reference/ensembl-sequence/wrapper.py file. The error messages related to invalid datatype for selecting a single chromosome have been clarified. The variable success is now initialized within the loop iterating over suffixes, and the control flow has been adjusted to break the loop after the first successful download attempt if chromosome is specified. The overall logic for determining suffixes based on datatype remains unchanged, and error handling continues to raise a ValueError.

Changes

File Path Change Summary
bio/reference/ensembl-sequence/wrapper.py Enhanced error messages for invalid datatype; initialized success within suffix loop; adjusted control flow for downloading sequence data.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script
    participant Curl

    User->>Script: Execute script
    Script->>Script: Determine suffixes based on datatype
    Script->>Script: Check if chromosome is specified
    alt Chromosome not specified
        Script->>Script: Attempt download
        Script->>Curl: Execute download command
        Script->>Script: Break loop after first success
    else Chromosome specified
        Script->>Script: Iterate over suffixes
        Script->>Curl: Execute download command for each suffix
    end
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 34a454a and 8d30eea.

📒 Files selected for processing (1)
  • bio/reference/ensembl-sequence/wrapper.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

🔇 Additional comments (1)
bio/reference/ensembl-sequence/wrapper.py (1)

Line range hint 19-23: Enhance input validation and error handling.

Consider the following improvements for better robustness:

  1. Validate chromosome values against a list of valid chromosomes
  2. Provide more specific error messages for invalid chromosomes
  3. Consider adding a check for output file size after each download

Let's verify the valid chromosome values for the specified species:

bio/reference/ensembl-sequence/wrapper.py Outdated Show resolved Hide resolved
bio/reference/ensembl-sequence/wrapper.py Outdated Show resolved Hide resolved
bio/reference/ensembl-sequence/wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)

Line range hint 58-66: Add success logging for better debugging.

While the success flag initialization is correct, adding logging would improve debugging capabilities.

    shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}")
    success = True
+    shell.logger.info(f"Successfully downloaded sequence from {url}")
🧰 Tools
🪛 Ruff

47-48: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)


53-53: Undefined name snakemake

(F821)


Line range hint 57-70: Consider enhancing multiple chromosome download implementation.

To better support multiple chromosome downloads, consider these improvements:

  1. Track successful and failed downloads separately
  2. Provide summary of download results
  3. Handle partial successes appropriately

Example approach:

successful_downloads = []
failed_downloads = []
for suffix in suffixes:
    url = f"{url_prefix}.{suffix}"
    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
        shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}")
        successful_downloads.append(suffix)
        shell.logger.info(f"Successfully downloaded {suffix}")
    except sp.CalledProcessError:
        failed_downloads.append(suffix)
        shell.logger.warning(f"Failed to download {suffix}")
        continue

if successful_downloads:
    shell.logger.info(f"Successfully downloaded: {', '.join(successful_downloads)}")
if failed_downloads:
    shell.logger.warning(f"Failed to download: {', '.join(failed_downloads)}")
    if not successful_downloads:
        exit(1)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dfdd743 and 23b8788.

📒 Files selected for processing (1)
  • bio/reference/ensembl-sequence/wrapper.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

🪛 Ruff
bio/reference/ensembl-sequence/wrapper.py

53-53: Undefined name snakemake

(F821)

🔇 Additional comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)

50-50: LGTM: Improved error message clarity.

The error message is now more precise and better explains the requirements for chromosome selection.


68-69: Previous critical issue still applies.

The break statement still prevents downloading multiple chromosomes, which contradicts the PR objective.

for suffix in suffixes:
success = False
Copy link
Collaborator

@fgvieira fgvieira Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the loop to check if at least one suffix was successful? This way it will only check the last suffix, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is outside of the loop, and we are requesting multiple chromosomes, this will turn true on any working chromosome, and then stay that way. So we will not get any debugging output and error thrown, in case any of the chromosomes is not available. So for the chromosomes case, we should reset this for every suffix in suffixes. For the other case, checking whether "dna.primary_assembly.fa.gz" or "dna.toplevel.fa.gz" is available, it will break out of the suffix in suffixes loop right after setting success = True and will otherwise be left with success = False after the last suffix that runs into the except:.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way it only checks if the last chromosome was available, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right. Very good catch. Let me think about what the best solution is...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe moving the error checking directly to the try/except?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 23b8788 and 19536bc.

📒 Files selected for processing (1)
  • bio/reference/ensembl-sequence/wrapper.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

🪛 Ruff
bio/reference/ensembl-sequence/wrapper.py

53-53: Undefined name snakemake

(F821)

🔇 Additional comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)

50-50: LGTM: Improved error message clarity.

The error message is now more precise and user-friendly.


75-76: LGTM: Correct handling of non-chromosome case.

The break statement is correctly guarded to only exit early when not downloading specific chromosomes.

Comment on lines +64 to +70
if chromosome:
print(
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
file=sys.stderr,
)
break
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Remove break statement and use proper logging.

The current implementation has two issues:

  1. The break statement prevents downloading multiple chromosomes
  2. Direct print statements should use Snakemake's logging mechanism

Apply these changes:

 if chromosome:
-    print(
-        f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
-        file=sys.stderr,
-    )
-    break
+    shell.logger.error(
+        f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
+    )
+    continue
 else:
     continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if chromosome:
print(
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
file=sys.stderr,
)
break
else:
if chromosome:
shell.logger.error(
f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
)
continue
else:

for suffix in suffixes:
success = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Track download success for each chromosome separately.

The current implementation resets the success flag for each suffix, which could mask failures of individual chromosome downloads. Consider tracking success per chromosome:

-success = False
+successes = set()  # Track successful downloads

Then update the success tracking after download:

-success = True
+successes.add(suffix)  # Record successful download

And modify the final check:

-if not success:
+if not successes:

This change will help identify which specific chromosomes failed to download.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +79 to +87
if not chromosome:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
print(
f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
file=sys.stderr,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error reporting with Snakemake logging.

The error reporting can be improved by:

  1. Using Snakemake's logging mechanism
  2. Providing more detailed feedback about failed chromosomes
-if not success:
+if not successes:
     if not chromosome:
         if len(suffixes) > 1:
             url = f"{url_prefix}.[{'|'.join(suffixes)}]"
         else:
             url = f"{url_prefix}.{suffixes[0]}"
-        print(
+        shell.logger.error(
             f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
-            file=sys.stderr,
         )
+else:
+    failed = set(suffixes) - successes
+    if failed:
+        shell.logger.error(
+            f"Failed to download the following chromosomes: {', '.join(failed)}"
+        )

-print(
+shell.logger.error(
     "Please check whether above URL is currently available (might be a temporal server issue). "
     "Apart from that, did you check that this combination of species, build, and release is actually provided?",
-    file=sys.stderr,
)

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants